Virtual Machine On VMware ESXi Hypervisor Will Stop Responding or Fail to Power On When Configured With the NVIDIA A40/A10 PCIe Graphics Accelerator As a "Passthrough" Device
search cancel

Virtual Machine On VMware ESXi Hypervisor Will Stop Responding or Fail to Power On When Configured With the NVIDIA A40/A10 PCIe Graphics Accelerator As a "Passthrough" Device

book

Article ID: 319788

calendar_today

Updated On: 11-13-2023

Products

VMware vSphere ESXi

Issue/Introduction

VM with GPU in passthrough mode fails to start on HPE server versions:

 

  • HPE Apollo 6500 Gen10 Plus System.
  • HPE ProLiant DL380 Gen10 server.
  • HPE ProLiant DL380 Gen10 Plus server.
  • HPE ProLiant DL385 Gen10 Plus server.
  • HPE ProLiant DL385 Gen10 server.
  • HPE ProLiant XL645d Gen10 Plus Configure-to-order Server.
  • HPE ProLiant XL675d Gen10 Plus Configure-to-order Server.


Symptoms:

ESXi 7.0.3 update 3n fails to start a VM with a GPU connected to it via passthrough.

VM fails to start with an error: Module DevicePowerOn power on failed. Failed to start the virtual machine. Device 6:0.0 is not a passthrough device.

Also you could see hot-plug events during power on VM like below from vmkernel log
2023-07-31T05:51:43.091Z cpu144:15901592)PCIPassthru: 3873: pcipDevInfo(0x43174d483300) allocated for 0000:a9:00.0
2023-07-31T05:51:43.093Z cpu98:2097948)PCIEHP: 1564: 0000:a8:01.0: hotplug slot:0x1: num reads=1 slot status=0x108.
2023-07-31T05:51:43.093Z cpu98:2097948)PCIEHP: 1496: 0000:a8:01.0: hotplug slot:0x1 (0000:a9:00.0) Adapter removed.
2023-07-31T05:51:43.093Z cpu98:2097948)PCIEHP: 380: 0000:a8:01.0: hotplug slot:0x1: Setting PowerIndicator State BLINKING
2023-07-31T05:51:43.094Z cpu98:2097948)PCIEHP: 1048: 0000:a8:01.0: Disabling hotplug slot:0x1
2023-07-31T05:51:45.234Z cpu3:2097947)PCIEHP: 1477: 0000:a8:01.0: hotplug slot:0x1 (0000:a9:00.0) Adapter inserted.
2023-07-31T05:51:45.337Z cpu3:2097947)PCIEHP: 380: 0000:a8:01.0: hotplug slot:0x1: Setting PowerIndicator State BLINKING
2023-07-31T05:51:45.338Z cpu3:2097947)PCIEHP: 982: 0000:a8:01.0: Enabling hotplug slot:0x1
2023-07-31T05:51:45.348Z cpu3:2097947)AMDIommu: 996: IOMMU 0000:a0:00.2: Prepared IOMMU for hotplug device 0000:a9:00.0
2023-07-31T05:51:45.348Z cpu3:2097947)WARNING: PCIEHP: 641: 0000:a8:01.0: hotplug slot: 0x1: Device insertion detected while prior device 0000:a9:00.0 removal is still pending
 


Environment

VMware vSphere ESXi 7.0

Cause

Known issue with HPE: https://support.hpe.com/hpesc/public/docDisplay?docId=a00121002en_us

Resolution

This issue can be avoided by disabling the PCIe device hot-plug in the VMware ESXi host installed on the server:

 

1. On the bare metal ESXi host, enter the command:

  • esxcli system settings kernel set -s enablePCIEHotplug -v FALSE

2. Reboot the ESXi host.

 

3. Verify that PCIe device hot-plug is disabled by entering the command:

  • esxcli system settings kernel list -o enablePCIEHotplug

4. The entry, "FALSE," should be displayed under the Runtime column:

 

5. After changing this setting, the VMs will function properly when running the GPUs in VMware pass-through mode.

 


Additional Information

Impact/Risks:

VM power on fails